-
Notifications
You must be signed in to change notification settings - Fork 29.5k
PerceptionLM #37878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
PerceptionLM #37878
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, super excited to get the model shipped! I know it is early to review, I noticed the model doesn't have a modular file yet. I recommend to use modular transformers to add the model, it will allow you to inherit from any similar model in transformers and you won't have to rewrite the whole class
Also it makes the review process easier and faster, since we see what are the main differences between PE and other existing model 😉
I see. Let me take a look. The classes here were added automatically (first commit: plm template) from running this command:
|
Yeah, that way is correct and usually copies an existing similar model. In this case Llava was written without modular as it was almost the first VLM in transformers :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for great work @shuminghu !
My main concern currently is to make the image processing and the vision tower more aligned with transformers standards. I left some comments below on how we could load the model, lmk if that works
src/transformers/models/perception_lm/processing_perception_lm.py
Outdated
Show resolved
Hide resolved
Model tester for `PerceptionLMForConditionalGeneration`. | ||
""" | ||
|
||
all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_generative_model_classes
as well, fast generation tests are needed to make sure we don't break anything in the future. We don't normally run integration tests for all PRs
src/transformers/models/perception_lm/configuration_perception_lm.py
Outdated
Show resolved
Hide resolved
@shuminghu ping me when it is ready for review and can you mark it PR as "ready for review' as well' |
Yes. Absolutely. Circling back to this now.
…On Mon, May 26, 2025 at 12:33 AM Raushan Turganbay ***@***.***> wrote:
*zucchini-nlp* left a comment (huggingface/transformers#37878)
<#37878 (comment)>
@shuminghu <https://github.com/shuminghu> ping me when it is ready for
review and can you mark it PR as "ready for review' as well'
—
Reply to this email directly, view it on GitHub
<#37878 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWMMF4SLPGNHMYQTRWYENL3AK7V3AVCNFSM6AAAAAB4EIC6R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMBYHA2DAMZQGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, super clean! ❤️ I left mostly comments about nits, like deleting unnecessary files and print
statements
The only major comment to address is about moving contents from image_trasnfom.py
to the processing file, and making sure it follows the transformers format style (i.e. uses torchvision.F
without compose)
## Overview | ||
|
||
The PerceptionLM model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>. | ||
<INSERT SHORT SUMMARY HERE> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs to be filled before merging
use_cls_token=True, | ||
architecture="vit_pe_core_large_patch14_336", | ||
width=1024, | ||
img_size=[448, 448], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't use mutables as defaults for args in python, same for ref_feat_shape
. Lets make it a tuple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed them from Tuple coz a unit test failed
tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_config FAILED [ 12%]
============================================================================= short test summary info ==============================================================================
FAILED tests/models/perception_lm/test_modeling_perception_lm.py::PerceptionLMForConditionalGenerationModelTest::test_config - AssertionError: {'ima[1414 chars]'', 'model_type': 'perception_encoder', 'use_c[3584 chars]alse} != {'ima[1414 chars]'', 'use_cls_token': True, 'architecture': 'vi[3584 chars]...
=========================================================================== 1 failed, 1 warning in 1.19s ============================
it turns out config serialized from json produces List which is not equal to original Tuple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can initialize te default if after as in img_size = img_size is img_size is not None else [448, 448]
src/transformers/models/perception_lm/configuration_perception_lm.py
Outdated
Show resolved
Hide resolved
import torch | ||
|
||
|
||
TEST_MODEL_PATH = "shumingh/plm_1b_hf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to update to actual ckpt after the model is released and public
# input_ids[ | ||
# :, self.num_image_tokens : self.num_video_tokens + self.num_image_tokens | ||
# ] = config.video_token_id | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
video_inference
doesn't work? We would need tests for all modalities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support mixing image and video in the same prompt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we can add a separate Tester for video only, so it is tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back. it doesnt matter here.
pass | ||
|
||
|
||
TEST_MODEL_PATH = "shumingh/plm_1b_hf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same, to be updated before merge, so we dont forget
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Approving as the PR looks good, left only minor comments for readability. Let's make the CI green and ask for last review from core maintainers
For the failing integration tests that a timm
model isn't found, cc @ydshieh . Which version on timm
do we have in the runners and can we update it? We'll also have another big release soon based on timm as a backbone
use_cls_token=True, | ||
architecture="vit_pe_core_large_patch14_336", | ||
width=1024, | ||
img_size=[448, 448], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can initialize te default if after as in img_size = img_size is img_size is not None else [448, 448]
src/transformers/models/perception_lm/image_processing_perception_lm_fast.py
Show resolved
Hide resolved
src/transformers/models/perception_lm/image_processing_perception_lm_fast.py
Outdated
Show resolved
Hide resolved
src/transformers/models/perception_lm/image_processing_perception_lm_fast.py
Outdated
Show resolved
Hide resolved
thumbnails, _ = self.resize(stacked_images, self.tile_size, max_num_tiles=1) | ||
images_for_tiling, (tiles_w, tiles_h) = self.resize( | ||
stacked_images, self.tile_size, max_num_tiles=self.max_num_tiles | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to configure self.tile_size
at call time and it is guaranteed to be in the input args. So we need to expand the signature args and use tile_size
. Same for max_num_tiles
Hi, it doesn't seems just integration tests but all tests, if we are talking about
In the runner, we have
(this info. could be find in the job step However, we can not use a model like For integration tests, we can use it. |
Right, we need to modify non-integration tests @shuminghu . And for the integration ones, do we need to add a decorator for timm version so it doesn't flood nightlies with failures or do we update timm in runners? |
vit_pe_core_large_patch14_336
Here is architecture template. But actual number of layers and dims are
small numbers not ViT-L scale in non-integration tests.
…On Tue, Jun 17, 2025 at 2:31 AM Raushan Turganbay ***@***.***> wrote:
*zucchini-nlp* left a comment (huggingface/transformers#37878)
<#37878 (comment)>
Right, we need to modify non-integration tests @shuminghu
<https://github.com/shuminghu> . And for the integration ones, do we need
to add a decorator for timm version so it doesn't flood nightlies with
failures or do we update timm in runners?
—
Reply to this email directly, view it on GitHub
<#37878 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAWMMF5VXCCHJGODU7YWDPL3D7N7NAVCNFSM6AAAAAB4EIC6R6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZZGY2DENRXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@ydshieh
|
Happy to upgrade timm in docker if @zucchini-nlp and @qubvel confirm it is necessary. Does this mean we also have to pin timm version in (And this requirement is specific to this new model or we need it anyway in general?) |
We also need the latest timm for gemma3n, because mobilenetv5 was added just recently, but not sure if we should pin reqs strict to the latest version, maybe just add version check with correct error message |
we have timm==1.0.15 at this moment, and we have v1.0.16 released 16 hours ago). @shuminghu Are the failing tests caused by the |
@ydshieh Right. v1.0.16 would be good for me for CI. From Release v1.0.16 change log: --- upate ---- |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! Great great work, super clean 🤗🤗 Just a few comments to polish a bit, but this is very nice already and almost ready to be shipped!👌
|
||
from transformers.generation.utils import GenerationMixin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be consistent and use relative imports everywhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
_CONFIG_FOR_DOC = "PerceptionLMConfig" | ||
|
||
# Base docstring | ||
_CHECKPOINT_FOR_DOC = "facebook/Perception-LM-1B" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed as it does not seem to be used (they were here before we had auto_docstring
mostly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
class AdaptiveAvgPooling(nn.Module): | ||
def __init__(self, pooling_ratio=2): | ||
super(AdaptiveAvgPooling, self).__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
class AdaptiveAvgPooling(nn.Module): | |
def __init__(self, pooling_ratio=2): | |
super(AdaptiveAvgPooling, self).__init__() | |
class PerceptionLMAdaptiveAvgPooling(nn.Module): | |
def __init__(self, pooling_ratio=2): | |
super().__init__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
self.pooling = ( | ||
AdaptiveAvgPooling(config.projector_pooling_ratio) if config.projector_pooling_ratio > 1 else nn.Identity() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have checkpoints where it's actually > 1? I see it's 1 by default in the config. If it's not used in any real checkpoint, let's remove this layer for simplicity!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we use pooling ratio 2.
https://huggingface.co/facebook/Perception-LM-1B/blob/main/params.json#L19
self.projector = nn.ModuleList( | ||
[ | ||
nn.Linear( | ||
in_features=input_size, | ||
out_features=output_size, | ||
bias=True, | ||
), | ||
nn.GELU(), | ||
nn.Linear( | ||
in_features=output_size, | ||
out_features=output_size, | ||
bias=True, | ||
), | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if possible, I'd rather we unwrap those layers and use them directly in the forward as well instead if iterating, for clarity (would require to update the converter as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Make modules available throught conditional class for BC | ||
@property | ||
def language_model(self): | ||
return self.model.language_model | ||
|
||
@property | ||
def vision_tower(self): | ||
return self.model.vision_tower | ||
|
||
@property | ||
def multi_modal_projector(self): | ||
return self.model.multi_modal_projector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not need those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I added this back (BC). This was added to fix this test
@require_torch_sdpa
def test_sdpa_can_dispatch_composite_models(self):
"""
Tests if composite models dispatch correctly on SDPA/eager when requested so when loading the model.
This tests only by looking at layer names, as usually SDPA layers are called "SDPAAttention".
In contrast to the above test, this one checks if the "config._attn_implamentation" is a dict after the model
is loaded, because we manually replicate requested attn implementation on each sub-config when loading.
See https://github.com/huggingface/transformers/pull/32238 for more info
The test tries to cover most general cases of composite models, VLMs with vision and text configs. Any model
that has a different set of sub-configs has to overwrite this test.
"""
if not self.has_attentions:
self.skipTest(reason="Model architecture does not support attentions")
if not self._is_composite:
self.skipTest(f"{self.all_model_classes[0].__name__} does not support SDPA")
for model_class in self.all_model_classes:
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
model = model_class(config)
with tempfile.TemporaryDirectory() as tmpdirname:
model.save_pretrained(tmpdirname)
model_sdpa = model_class.from_pretrained(tmpdirname)
model_sdpa = model_sdpa.eval().to(torch_device)
vision_model_names = {"visual", "image_tower", "vision_tower", "vision_model"}
language_model_names = {"language_model", "model", "text_model"}
> vision_model_name = [name for name in vision_model_names if hasattr(model_sdpa, name)][0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E IndexError: list index out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mentioned test_sdpa_can_dispatch_composite_models in the comment.
@auto_docstring | ||
class PerceptionLMModel(LlavaModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LlavaModel has a mapping that we should not need here, and that we should overwrite to empty explicitly
@auto_docstring | |
class PerceptionLMModel(LlavaModel): | |
@auto_docstring | |
class PerceptionLMModel(LlavaModel): | |
_checkpoint_conversion_mapping = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
use_cache: Optional[bool] = None, | ||
output_attentions: Optional[bool] = None, | ||
output_hidden_states: Optional[bool] = None, | ||
return_dict: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar as before for return_dict
with the decorator! Let's remove it everywhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed return_dict
from all methods.
@unittest.skip("ViT PE cannot be tested with meta device") | ||
def test_can_be_initialized_on_meta(self): | ||
pass | ||
|
||
@unittest.skip("ViT PE cannot be tested with meta device") | ||
def test_can_load_with_meta_device_context_manager(self): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that due to timm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. Mentioned timm in reason string.
def test_training(self): | ||
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else () | ||
super().test_training() | ||
|
||
def test_training_gradient_checkpointing(self): | ||
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else () | ||
super().test_training_gradient_checkpointing() | ||
|
||
def test_training_gradient_checkpointing_use_reentrant(self): | ||
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else () | ||
super().test_training_gradient_checkpointing_use_reentrant() | ||
|
||
def test_training_gradient_checkpointing_use_reentrant_false(self): | ||
self.all_model_classes = (PerceptionLMForConditionalGeneration,) if is_torch_available() else () | ||
super().test_training_gradient_checkpointing_use_reentrant_false() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never remember exactly how pytest runs the tests, but I think that overriding self.all_model_classes
like this may pollute some other tests using the attribute 🥲 Why do we change it? Should work with the base model as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh these are instantiated per test. I tested before by printing "self.all_model_classes" at the beginning of each test when running them all.
The reason is loss is not defined base model (PerceptionLMModel
), only PerceptionLMForConditionalGeneration
has lm_head, loss, labels defined which are used in testing. I think other models (llava_onvision...) chose to skip these tests. I kept them here as i think tests are still valid for PerceptionLMForConditionalGeneration
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, perception_lm |
This PR implements PerceptionLM released by Meta:
https://github.com/facebookresearch/perception_models